DEAlgo is an R package that identifies potential artifacts in single-cell RNA sequencing data (including doublets, multiplets and local contamination) and suggests potential source(s) of these artifacts.
DEAlgo was developed with Seurat 4.9.9.9060 (https://satijalab.org/seurat/).
DEAlgo was published on bioRxiv in July, 2024:
Current doublet detection and ambient RNA removal algorithms underestimate the complexity and heterogeneity of technical artifacts within single-cell analysis. Artifacts can occur at any stage, from incomplete cell dissociation to doublet/multiplet formation in droplet-based microfluidic or microwell systems. The standard preprocessing pipeline, which is well-established in single-cell analysis, recommends using ambient RNA removal and doublet detection algorithms sequentially. However, despite stringent implementations of these methods, we observed that a significant population of cells in our datasets remains enriched with cell-specific marker genes that are biologically incongruent. For example, subclusters in adipocytes enriched for the endothelial marker gene PECAM1, falsely enriching for angiogenesis pathways. Other published pre-processed data have also reported similar issues. Due to a lack of robust filtering methods, researchers often resort to manual removal of contaminated cells, leading to biased outcomes.
In a real-world situation, multiplets and ambient RNA are not mutually exclusive. Instead, doublets/multiplets/cell fractions and ambient RNA can occur concurrently, both globally (i.e., across all cell types) and locally (i.e., cell-type dependent), resulting in single cell artifacts. Therefore, doublet detection and ambient RNA effects, both global and local, should not be treated separately.
To address this, we designed scCLINIC: single-cell CLeaner: Identify aNd Interpret Contaminations, which systematically detects artifacts without any reference and ascertains the likely contaminant source for review and further curation. scCLINIC achieved higher AUPRC and AUROC scores when benchmarked against other doublet detection algorithms using a simulated dataset. We applied scCLINIC to publicly available datasets and identified artifacts that were not detected by other doublet detection algorithms. These artifacts uniquely detected by scCLINIC contributed to the enrichment of erroneous pathways, including enrichment of adipocyte stem/progenitor cell (ASPC) related pathways in immune cells of diabetic patients. We could also derive biological explanations for the contamination patterns detected by scCLINIC, such as increased platelet aggregation with monocytes and lymphocytes in COVID-19 patients.
(28/5/2024) Initial version (06/6/2024) 1. Using Differential Percentage Expression (DPE) to remove low quality cells instead of using low IS score 2. scCLINIC contamination score calculation based on Averaging ES instead of summing IS score 3. Classified subclusters as different level of contamination instead of binary classification of Contaminated or Non-contaminated cells 4. New PlotContaminationPattern function to dissect the source of artifacts in each subclusters
Sys.setenv(GITHUB_PAT = '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')
remotes::install_github('JayShinLab/dealgolorg@V04062024.yx',auth_token = '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')
DEAlgo requires the following R packages:
library(dealgolorg)
obj <- load_data()
#1. W/O user Annotation
DEAlgo_Main_Function("kidney_test",obj,"~/DEAlgoManuscript/test/") #replace "~/user/output/" with the directory to store DEAlgo result
#2. Using user annotation OverlapRatio = "CT.Park"
DEAlgo_Main_Function("kidney_manual_test",obj,"~/DEAlgoManuscript/test/", OverlapRatio = "CT.Park",CELLANNOTATION = TRUE)
library(dealgolorg)
library(dplyr)
library(pracma)
library(Seurat)
library(ggplot2)
library(tidyr)
library(pheatmap)
library(viridisLite)
library(reshape2)
library(gridExtra)
Name <- "kidney_test"
Input <- load_data()
Output <- "~/DEAlgoManuscript/test/"
filteredmatrix=NA
rawmatrix=NA
resol=0.8
overlapRatioList=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)
OverlapRatio=0.5
ISThreshold=0
Cutoff=10
gene_n=150
obj <- STEP1A_GlobalMarkers(Input,Output,Name,resol)
obj <- STEP1B_MergingCluster(obj,Output,Name,resol,overlapRatioList,gene_n)
obj <- STEP1C_RecalculateGlobalMarkers_IdentityScore(obj,Output,Name,resol,OverlapRatio,gene_n)
obj <- STEP1D_FilterLowISCluster(obj,Output,Name,resol,OverlapRatio,ISThreshold)
STEP2A_Subcluster(obj,Output,Name,resol,OverlapRatio,gene_n)
obj <- STEP2B_ContaminationScore(obj,Output,Name,resol,OverlapRatio,gene_n,Cutoff,filteredmatrix,rawmatrix)
PlotContaminationPattern(obj,Output,Name,OverlapRatio)
library(dealgolorg)
library(dplyr)
library(pracma)
library(Seurat)
library(ggplot2)
library(tidyr)
library(pheatmap)
library(viridisLite)
library(reshape2)
library(gridExtra)
Name <- "kidney_manual_test"
Input <- load_data()
Output <- "~/DEAlgoManuscript/test/"
filteredmatrix=NA
rawmatrix=NA
resol="Manual"
OverlapRatio="CT.Park"
Cutoff=10
gene_n=150
CELLANNOTATION = TRUE
obj <- STEP1C_RecalculateGlobalMarkers_IdentityScore(obj,Output,Name,resol,OverlapRatio,gene_n,CELLANNOTATION = TRUE)
obj <- STEP1D_FilterLowISCluster(obj,Output,Name,resol,OverlapRatio,ISThreshold,CELLANNOTATION = TRUE)
STEP2A_Subcluster(obj,Output,Name,resol,OverlapRatio,gene_n,CELLANNOTATION = TRUE)
obj <- STEP2B_ContaminationScore(obj,Output,Name,resol,OverlapRatio,gene_n,Cutoff,filteredmatrix,rawmatrix,CELLANNOTATION = TRUE)
PlotContaminationPattern(obj,Output,Name,OverlapRatio,CELLANNOTATION = TRUE)
scCLINIC workflow comprises of two steps:
detect cell clusters with low identity and exclude them from analysis and
identify artifact cell clusters and ascertain the contamination source(s).
Figure 1. Types of local soup effect and DEAlgo infrastructure designed
to address those effects.
scCLINIC takes in either the post-QC Seurat object (using standard single-cell pre-processing workflows) (Step1A) or post-clustering Seurat objects after cell-type clustering and removal of low quality cells (Step1C).
Figure 2. Standard Seurat clustering with 0.8 resolution, and the
nCount.
Figure 3A. threshold = 10%
overlap